Skip to content

[SPARK-26560][SQL]:Repeat select on HiveUDF fails#23921

Closed
nivo091 wants to merge 1 commit into
apache:masterfrom
nivo091:udf_fix
Closed

[SPARK-26560][SQL]:Repeat select on HiveUDF fails#23921
nivo091 wants to merge 1 commit into
apache:masterfrom
nivo091:udf_fix

Conversation

@nivo091

@nivo091 nivo091 commented Feb 28, 2019

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The classloader of spark-shell is IMainsTranslatingClassLoader which is loaded from IMain (scala.tools.nsc.interpreter). But on the first select, it registers the function and loads the hiveUDF jar to sparkContext classloader which is NonClosableMutuableClassLoader. While selecting the function on the second time, function is already registered so it tries to fetch from IMainTranslatingClassLoader which is giving analysis exception.

Changing of the classLoader of currentThread to NonClosableMutuableclassLoader will solve this issue.

@HyukjinKwon

Copy link
Copy Markdown
Member

Can you fix the PR title? It sounds like we have a problem in SELECT of SparkSQL. The problem is about Hive UDF in a specific case.

@HyukjinKwon

Copy link
Copy Markdown
Member

Also, please describe how the current PR fixes the issue in the PR description.

@nivo091 nivo091 changed the title SPARK-26560:Repeat select fail fix SPARK-26560:Repeat select on HiveUDF fails in Spark-shell Mar 1, 2019
@nivo091

nivo091 commented Mar 1, 2019

Copy link
Copy Markdown
Contributor Author

I have updated the PR description and title, please review. @HyukjinKwon

@HyukjinKwon HyukjinKwon changed the title SPARK-26560:Repeat select on HiveUDF fails in Spark-shell SPARK-26560:Repeat select on HiveUDF fails Mar 1, 2019
@nivo091

nivo091 commented Mar 1, 2019

Copy link
Copy Markdown
Contributor Author

@HyukjinKwon This issue happens only in Spark-shell, that's why I added in the title. Is that not required ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull up the flower bracket inline with finally

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this try block is required now? please check once again

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this inside try and put the catch block to the outer try.

@sujith71955

Copy link
Copy Markdown
Contributor

@nivo091 Please correct the PR title format seems to be not in same as standard mentioned by the community.

@nivo091 nivo091 changed the title SPARK-26560:Repeat select on HiveUDF fails [SPARK-26560][SQL]:Repeat select on HiveUDF fails Mar 5, 2019
@nivo091

nivo091 commented Mar 5, 2019

Copy link
Copy Markdown
Contributor Author

@sujith71955 : Handled, thanks

@sujith71955

Copy link
Copy Markdown
Contributor

cc @srowen @HyukjinKwon

@nivo091

nivo091 commented Mar 14, 2019

Copy link
Copy Markdown
Contributor Author

cc @xiao Li

@SparkQA

SparkQA commented Mar 14, 2019

Copy link
Copy Markdown

Test build #4621 has finished for PR 23921 at commit 4855060.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon

Copy link
Copy Markdown
Member

Can you make the PR description concise? It's difficult to read and follow. Why does the classloader change in the second time? Can you also point out the related codes?

@nivo091

nivo091 commented Mar 19, 2019

Copy link
Copy Markdown
Contributor Author

Spark-shell have the IMainTranslatingClassLoader which is loaded from IMain (scala.tools.nsc.interpreter).
For the first select of function, it loads the hiveUDF jar in the below line.

session.sharedState.jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(session.sharedState.jarClassLoader)

@AmplabJenkins

Copy link
Copy Markdown

Can one of the admins verify this patch?

Try(super.makeFunctionExpression(name, clazz, input)).getOrElse {
var udfExpr: Option[Expression] = None
try {
val originalClassLoader = Thread.currentThread().getContextClassLoader()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to modify original code but just switch context classloader. Then, use Utils.withContextClassLoader and just pass original code into fn.

@HeartSaVioR

Copy link
Copy Markdown
Contributor

Btw, please follow the template of PR description. And we describe PR title for what the patch fixes instead of what was the bug.

@HeartSaVioR

Copy link
Copy Markdown
Contributor

I'm taking this over, as it's a bit old and the test should be added as well. Please refer #27025.
I've retained the commit to give proper credit. Thanks!

cloud-fan pushed a commit that referenced this pull request Jan 2, 2020
…ardless of current thread context classloader

### What changes were proposed in this pull request?

This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from #23921 to retain credit.)

Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.

This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.

This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.

This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.

### Why are the changes needed?

Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT.

Closes #27025 from HeartSaVioR/SPARK-26560-revised.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: nivo091 <nivedeeta.singh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HeartSaVioR added a commit to HeartSaVioR/spark that referenced this pull request Jan 2, 2020
…ardless of current thread context classloader

This patch is based on apache#23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from apache#23921 to retain credit.)

Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.

This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.

This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.

This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.

Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.

No.

New UT.

Closes apache#27025 from HeartSaVioR/SPARK-26560-revised.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: nivo091 <nivedeeta.singh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jan 2, 2020
…r regardless of current thread context classloader

### What changes were proposed in this pull request?

This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from #23921 to retain credit.)

Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.

This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.

This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.

This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.

### Why are the changes needed?

Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT.

Closes #27075 from HeartSaVioR/SPARK-26560-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants